Language Resource Addition: Dictionary or Corpus?
نویسندگان
چکیده
In this paper, we investigate the relative effect of two strategies of language resource additions to the word segmentation problem and partof-speech tagging problem in Japanese. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that the annotated sentence addition to the training corpus is better than the entries addition to the dictionary. And the annotated sentence addition is efficient especially when we add new words with contexts of three real occurrences as partially annotated sentences. According to this knowledge, we executed annotation on the invention disclosure texts and observed word segmentation accuracy.
منابع مشابه
Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application
The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...
متن کاملIntegrating Corpus Consultation in Language Studies
Alongside developments in language research, the potential of corpora as a resource in language learning and teaching has been evident to researchers and teachers since the late 1960s. Despite publications which emphasise the benefits of corpus consultation for language learners (Bernardini, 2002; Kennedy & Miceli, 2001), there is little evidence to suggest that direct corpus consultation is co...
متن کاملMETIS-II: Machine Translation for Low Resource Languages
In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several...
متن کاملDevelopment of a WFST based Speech Recognition System for a Resource Deficient Language Using Machine Translation
Text corpus size is an important issue when building a language model (LM) in particular where insufficient training and evaluation data are available. In this paper we continue our work on creating a speech recognition system with a LM that is trained on a small amount of text in the target language. In order to get better performance we use a large amount of foreign text and a dictionary mapp...
متن کاملIntegrating Dictionaries into an Unsupervised Model for Myanmar Word Segmentation
This paper addresses the problem of word segmentation for low resource languages, with the main focus being on Myanmar language. In our proposed method, we focus on exploiting limited amounts of dictionary resource, in an attempt to improve the segmentation quality of an unsupervised word segmenter. Three models are proposed. In the first, a set of dictionaries (separate dictionaries for differ...
متن کامل